Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
نویسندگان
چکیده
Many “big data” applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databases— parallel recovery of lost state—and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.
منابع مشابه
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Many important “big data” applications need to process data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or lon...
متن کاملLarge-Scale Online Expectation Maximization with Spark Streaming
Many “Big Data” applications in Machine Learning (ML) need to react quickly to large streams of incoming data. The standard paradigm nowadays is to run ML algorithms on frameworks designed for batch operations, such as MapReduce or Hadoop. By design, these frameworks are not a good match for low-latency applications. This is why we explore using a new, recently proposed model for large-scale st...
متن کاملReplication Schemes to Support Failure Resilient Processing of Real Time Data Streams
In this paper we explore the use of replication for fault tolerant processing of streams. We perform these experiments in the context of the Granules stream processing system that is designed for real time processing of data streams generated by devices and instruments. In this paper we explore well-known replication schemes for fault tolerant processing of data streams. We analyze two basic ap...
متن کاملWAVES: Big Data Platform for Real-time RDF Stream Processing
Processing data as they arrive has recently gained momentum to mine continuous, high-volume and unbounded sequence of data streams. Due to the heterogeneity and the multi-modality of this data, RDF is widely used to provide a unified metadata layer in streaming context. In response to this ever-increasing demand, a number of systems and languages were produced, aiming at RDF stream processing (...
متن کاملRobust Security Mechanisms for Data Streams Systems
Stream database systems are designed to support the fast on-line processing that characterizes many new emerging applications such as pervasive computing, sensor-based environments, on-line business processing and network monitoring. The sensitive nature of the data and the high-demands environment where data can be lost or dropped because of limited buffer storage or real-time constraints, req...
متن کامل